Current code supports Julia version 0.6.4, and with luck it may work in v0.7.
versioninfo()
The package PlotlyJS must be on version 10.2 and Blink must be on version 0.6.2. One can check package versions by doing Pkg.status("Blink"). If your version number exceeds these, you can tell julia to use an older version by pinning them as follows:
Pkg.status("Blink")
Pkg.pin("Blink", v"0.6.2")
Pkg.pin("PlotlyJS", v"0.10.2")
Pkg.build("Blink")
MendelKinship.jl is capable of calculating the theoretical kinship coefficient $\Phi_{ij}$ as long as a valid pedigree structure is provided. When SNP markers are available, MendelKinship.jl can also calculate empirical kinship coefficients using the MoM or GRM methods (see this paper for details). Here we recommend the MoM method because its estimates are more robust in the presence of rare alleles.
MendelKinship can optionally compare the empirical kinship and theoretical kinships to check suspect pedigree structures and reveal hidden relatedness. It can also reveal sample mixed ups or other laboratory errors that can lead to inaccurate empirical kinships. The result is saved in a table sorted in descending order. We optioanlly output 2 interactive plots that allow users to quickly pinpoint pairs with the greatest theoretical vs empirical deviance.
The input for all examples in this tutorial can be obtained from the free application Mendel v16 option 29a. These data were obtained from the 1000 genome project, containing 85 people and 253141 SNPs, half of which have $maf < 0.05$. Using these founders' genotype, we simulated 127 extra people, resulting in 27 pedigrees and 212 people. Although the 85 individuals are treated as founders, they were actually somewhat related, and this is reflected in the kinship comparison in the 2nd example below. For more information on this dataset, please see Mendel's documentation example 29.4.
MendelKinship additionally accepts PLINK binary format as input, in which case the triplets (data.bim, data.bed, data.fam) must all be present. In this tutorial, there are no examples that uses these to import pedigree and SNP information. But if available, one can import the data by specifying the following in the control file:
plink_input_basename = data
However, sometimes the .fam file contains non-unique person id (2nd column of .fam file) across different pedigrees, which is currently not permitted. A person's id cannot be repeated in other pedigrees, even if it is contextually clear that they are different persons. This will be fixed in the near future.
| Keyword | Default Value | Allowed value | Description |
|---|---|---|---|
kinship_output_file |
Kinship_Output_File.txt | true/false | OpenMendel generated output file with table of kinship coefficients |
repetitions |
1 | Integer | Repetitions for sharing statistics |
xlinked_analysis |
false | beelean | Whether markers are on the X chromosome |
compare_kinships |
false | boolean | Whether we want to compare theoretical vs empiric kinship |
kinship_plot |
"" | User defined file name | A user specified name for a plot comparing theoretical and empiric kinship value |
z_score_plot |
"" | User defined file name | A user specified name for a plot of fisher's z statistic. |
grm_method |
MoM | GRM, MoM | Method used for empiric kinship calculation. Defaults to MoM (default), but user could choose the more common GRM method instead. (Warning: Based on our experience, Fisher's z score is very unreliable if the GRM method is used for rare (maf < 0.2) snps) |
maf_threshold |
0.01 | Real number between 0 and 1 | The minor allele frequency threshold for the GRM computation |
deviant_pairs |
false | Integer less than $n(n+1)/2$ | Number of top deviant pairs (theoretical vs empiric kinship) the user wants to keep |
A list of OpenMendel keywords common to most analysis package can be found here
Recall what is a valid pedigree structure. Note that we require a header line. The extension .in have no particular meaning. Let's examine (the first few lines of) such an example:
;head -10 "Ped29a.in"
A control file gives specific instructions to MendelKinship. To perform theoretical kinship calculation, an minimal control file looks like the following:
;cat "control_just_theoretical_29a.txt"
We used the package Suppressor to hide warnings. They will be removed when we update MendelKinship to Julia version 1.0. However often informative warnings and/or MendelKinship messages will be printed, so it is best practice for new users to at least review the messages.
using MendelKinship, CSV
using Suppressor # used to hide warnings
@suppress begin
Kinship("control_just_theoretical_29a.txt")
end
MendelKinship should have generated the filejust_theoretical_output.txt in your local directory. One can directly open the file, or import into the Julia environment for ease of manipulation using the DataFrames package. The fourth column contains the desired theoretical kinship coefficient. The 5th column contains the (deterministically) estimated Delta7 matrix. The 6th through the 14 columns contain the (stochastically) estimated Jacquard's 9 identity coefficients.
When both pedigree structure and complete SNP information are available, we can compare theoretical/empirical kinship coefficients. In practice, however, we often have individuals without genotype information, but nevertheless must be included in the pedigree structure. MendelKinship does not handle this situation yet, but an analysis option that supports these data is being developed. For now you can impute genotypes but keep in mind that the relationship comparison for these individuals who lack all genotype information will not be meaningful.
The pedigree file is the same as the pedigree file in the previous example. The SNP definition file requires a header row, and should have approprietely placed commas. It may be informative to compare the following SNP definition file with the original "SNP_def29a.in" in Mendel Option 29a.
;head -10 "SNP_def29a_converted.txt"
The SNP data files in this case must be stored in PLINK BED file in SNP-major format, with an accompanying SNP definition file. For an explanation of what these are, see MendelBase documentation.
If your have "data.bim", "data.bed", "data.fam" (i.e. the 3 triplet of PLINK files), then you can replace the 3 fields snpdata_file, snpdefinition_file, and pedigree_file in the next step with just 1 field:
plink_input_basename = data.
The following control file tells MendelKinship to compare theoretical kinship and empirical kinship, and output 2 interactive plots stored in .html format.
;cat "control_compare_29a.txt"
using MendelKinship, CSV
using Suppressor # used to hide warnings
@suppress begin
Kinship("control_compare_29a.txt")
end
Running the previous control file should have generated:
The table is sorted in descending order of the largest deviance between the theoretical and empiric kinship. The last column lists the Fisher's Z statistic (i.e. the number of standard deviations away from mean).
Interesting remark: In the first row, person 26732 and 264 have 0 theoretical kinship (i.e. they are both founders) but their empirical kinship is pretty close to 0.125 = 1/8. That is, we discovered that these 2 people which we initially thought are unrelated, may be half siblings, grandparent-grandchild, or an avuncular pair. There may also have been a sample mix up. Another explanation is that we are only using one chromosome's worth of data and so the estimates of kinship may be imprecise.
The code below are not necessary for the user to understand. The 2 plots should automatically be generated in your directory. Unfortunately the plots cannot be imported into the notebook (because they are stored as .html, which is needed to enable the interactive sessions). So the code that generated the plots is pasted here to regerate the plots.
using PlotlyJS
result = CSV.read("kinship_file_output.txt")
name = Vector{String}(size(result, 1))
for i in 1:length(name)
name[i] = "Person1=" * string(result[i, 3]) * ", " * "Person2=" * string(result[i, 4])
end
function linescatter()
trace1 = scatter(;x=result[:theoretical_kinship],
y=result[:empiric_kinship], mode="markers",
name="empiric kinship", text=name)
trace2 = scatter(;x=[1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 0.0],
y=[1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 0.0],
mode="markers", name="marker for midpoint")
layout = Layout(;title="Compare empiric vs theoretical kinship",
hovermode="closest",
xaxis=attr(title="Theoretical kinship (θ)", showgrid=false, zeroline=false),
yaxis=attr(title="Empiric Kinship", zeroline=false))
data = [trace1, trace2]
plot(data, layout)
end
linescatter()
This interactive plot allows user to quickly identify which pairs of persons have an empirical kinship most deviated from their expected (theoretical) kinship. The midpoint is placed as an orange dot for interpretability. As an example, the first row in the table above is the highest point on the left most spread. Careful readers might observe that there is a wider spread on those with 0 expected theoretical kinship. This is expected, because most people are not related to each other, so we are making many more comparisons that have 0 expected kinship.
function two_hists()
trace1 = histogram(x=result[:fishers_zscore], text=name)
data = [trace1]
layout = Layout(barmode="overlay",
title="Z-score plot for Fisher's statistic",
xaxis=attr(title="Standard deviations"),
yaxis=attr(title="count"))
plot(data, layout)
end
two_hists()
Fisher's z transform should give us samples from a standard normal distribution $N(0, 1)$. We ploted this statistic in plot 2, and at first glance, the distribution is approximately normal. In Julia, we can easily verify this by computing some summary statistics:
my_zscore = convert(Vector{Float64}, result[:fishers_zscore])
mean(my_zscore), var(my_zscore)
using StatsBase
skewness(my_zscore), kurtosis(my_zscore)